Modeling the topology of protein interaction networks 
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A major issue in biology is the understanding of the interactions between proteins. These interac- 
tions can be described by a network, where the proteins are modeled by nodes and the interactions 
by edges. The origin of these protein networks is not well understood yet. Here we present a two-step 
model, which generates clusters with the same topological properties as networks for protein-protein 
interactions, namely, the same degree distribution, cluster size distribution, clustering coefficient and 
shortest path length. The biological and model networks are not scale free but exhibit small world 
features. The model allows the fitting of different biological systems by tuning a single parameter. 

PACS numbers: 64.60.aq, 89.75.Fb, 87.15.km, 87.23.Kg 
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I. INTRODUCTION 

The study of complex networks in biology promises 
new fruitful insights about the functionality of genes and 
proteins [TH5]. Since the interactions between proteins 
determine their functionality, the properties and the ori- 
gin of the interaction networks have attracted much at- 
tention [BHS]- They consist of protein complexes, which 
are connected in a large, constantly evolving, cluster [S]. 
The analysis of hundreds of protein complexes has es- 
tablished that some of the relevant structural features 
are the contact area, the shape of the interfaces, the 
complementarity of surface shapes, and the interaction- 
mediating forces. Although not all interactions have been 
discovered yet, numerous studies have been performed 
and many data sets are available [HH - HU] . One important 
outcome of these studies is that most protein networks 
show a wide range of variability in the number of nodes 
and edges and the average connectivity degree (Table [I]). 
They appear not to be scale-free, namely, the distribu- 
tion of connectivity degrees is not a power law although 
it stretches over a significant number of orders of magni- 
tude. Moreover, they do not consist of one single cluster 
but in addition to a large component many small clusters 
of interactions are also detected. 

These results suggest that the specific features of bio- 
logical networks express different underlying mechanisms 
than do other networks, like social interaction networks 
or the internet [TTJ [TH]- In fact, it has been specu- 
lated that gene duplication is the dominant evolution- 
ary force in shaping biological networks [TUJ [19]. Con- 
versely, non-biological networks are typically driven by 
additive growth processes [TS] such as, for instance, pref- 
erential attachment |20j , but many other mechanisms like 
rewiring |21j . aging [22], or fitness [23] have been inves- 
tigated. However, none of these models can reproduce 



Organism 

Nocardia farcinica (NF) 
Bradyrhizobium japonicum (BJ) 
Aeromonas hydrophila (AH) 
Citrobacter koseri (CK) 
Escherichia coli (EC) 
Pseudomonas aeruginosa (PA) 
Serratia proteamaculans (SP) 
Vibrio cholerae (VC) 
Saccharomyces Cerevisae (SC) 
Homo Sapiens (HS) 



N 


M 


(k) 


3582 


12045 


6.7 


4883 


19261 


7.9 


2708 


9050 


6.7 


3373 


8212 


4.9 


3204 


13091 


8.2 


3794 


14252 


7.5 


3373 


8187 


4.9 


2512 


8612 


6.9 


4771 


54607 


22.9 


11102 


136930 


24.7 



a 

Off 
1.00 
0.75 
0.50 
0.75 
0.75 
0.75 
1.00 
1.75 
1.75 
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TABLE I. List of organisms from STRING 8.2 data set gg] 
investigated here. Columns report the number of nodes N, 
the number of edges M, the average degree (k), and the value 
of the model parameter a used here. Edges between pairs of 
proteins represent an 80% reliability of protein interaction. 
NF belongs to Acatinobacteria, BJ to Alphaproteobacteria, 
and all other bacteria belong to the Gammaproteobacteria 
class. 



the full topology of protein networks like, for instance, 
the emergence of isolated clusters found in real biological 
networks (Fig. [lj. 

Here we propose a different model, which reproduces 
many topology properties of protein interaction net- 
works. We do not consider the details of the biochemical 
mechanisms at the basis of each interaction, nor classify 
proteins in classes as in other approaches [3, 9, 24] . Con- 
versely, we follow a simple probabilistic approach. 



II. THE MODEL 

The procedure starts with a fully connected network 
of N sites and M = N(N - l)/2 edges. The number of 
nodes is equal to the number of nodes of the biological 
network considered, N — N\,\ a - The evolution is per- 
formed according to the following steps: 
(i) Choose at random a node i. 
(ii) Choose at random an edge e%j and remove it with a 
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FIG. 1. (Color online) The topology of (left) the AH network and (right) the model network with a — 0.75. The largest clusters 
are drawn in the center and the smaller clusters on the border. The largest clusters are drawn in the center and the smaller 
clusters on the border. The color code and the size (from small to large) represent the degree of each site on a logarithmic 
scale: blue k < 3, green 3 < k < 5, cyan 5 < k < 10, yellow 10 < k < 21, red 21 < k < 43 and purple k > 43|31j. 



probability pij related to the degree kj of the neighbor 
j of node i: 



P.! 



Pi i = —^ with Pi 

y ' 3 Ni F3 



kj a kj > 1 



otherwise 



(1) 



and with Ni the normalization Ni — X)z=iPz- a > is 
the only free parameter of the model and controls the rel- 
ative robustness of edges belonging to highly connected 
nodes with respect to edges of sites with low k. This rule 
implies that "the poor get poorer. The case a = im- 
plies that all sites have the same probability to lose edges 
and the process reduces to a random depletion. 
(iii) Repeat this procedure for another node i until the 
number of edges M in the network equals the number of 
nodes N. 

(iv) Choose at random two nodes i and j. Add an edge 
between these nodes with probability 

Pi, j = [N c (i,j)] 2 /(k i k j ), (2) 

where N c (i,j) is the number of neighbors that nodes i 
and j have in common. This step supposes that, if two 
given nodes are able to interact with the same nodes, 
they have a high probability to interact with each other, 
(v) Repeat this procedure for another random pair of 
nodes i and j until the number of edges M in the net- 
work equals the number of edges of the modeled biologi- 
cal network M\,i . 

These rules are based on the assumption that the evolu- 
tion is controlled by two basic mechanisms: (i) preferen- 
tial depletion: the lower the node degree, the lower the 
probability to maintain interactions |25) : (ii) similarity: 
the more common neighbors two nodes share, the higher 
is the probability to have an interaction. 
The first mechanism is important for the emergence of 
isolated clusters and a maximal degree, while the second 
one is necessary to generate networks with a high clus- 
tering coefficient and assortativity. It is interesting to 
notice that the implementation of the depletion mech- 
anism alone generates scale free networks and does not 




FIG. 2. (Color online) The degree distribution p(k) for AH 
(circles), BJ (triangles), CK (stars) and HS (squares) and 
their corresponding model networks (lines) with a obtained 
from Table IT] Star, triangle, and square data sets are shifted 
vertically by factors of 0.5, 2, and 5, respectively, for better 
visibility. 



reproduce the topology of protein-protein interacion net- 
works [25] . 



III. RESULTS 

The biological networks are obtained from the 
STRING 8.2 data set j26j, where a combined score of 
80% is used to decide whether two proteins interact. We 
tested our algorithm on the ten different biological net- 
works listed in Table III For each organism we determine 
a value of the parameter a which provides a good fit (Ta- 
ble ll| for the degree distribution. All results for model 
networks are averages over 100 independent runs for bac- 
teria and 10 runs for the other two networks. In Fig. [TJwe 
show an example for a biological network and the corre- 
sponding model network, with the same number of nodes 
and edges and a — 0.75. Both networks have one large 




FIG. 3. (Color online) Frequency f(S) of finding a cluster 
with a given number of nodes Sn for CK (circles), EC (trian- 
gles), and VC (stars) and with a given number of edges Sm 
(inset) for AH (circles), BJ (triangles), and VC (stars), and 
their corresponding model networks (lines). Top and bottom 
data sets are shifted vertically by one decade, upward and 
downward. 




FIG. 4. (Color online) Visualization of small- world properties 
of biological networks. Average shortest path length Ik of sites 
of degree k versus k for AH (circles), EC (triangles), and VC 
(stars) and clustering coefficient Ck (inset) of sites of degree 
k versus k for AH (circles), BJ (triangles), and EC (stars), 
and their corresponding numerical networks (lines). Top and 
bottom data sets for Ik are shifted vertically by factors of 2 
and 0.5. 



cluster with dangling ends, shown in the center of both 
graphs. Moreover, both networks have a large number 
of small clusters, placed on the border of each network. 
For both networks highly connected nodes are placed in 
the largest cluster, whereas small clusters are made of 
low-degree nodes. Since the topology is not a quanti- 
fied differentiation property to decide whether two net- 
works are similar, we calculate some fundamental prop- 
erties characterizing the connectivity and the structure 
of the two networks. The model has by construction the 
same numbers of nodes N and of edges M as the bio- 
logical one and therefore the average degrees per node 
(k) are exactly the same. To provide more information 
on the connectivity level of the two networks, we mea- 
sure first the degree distribution. In Fig. [2] we show the 
degree distribution of different biological networks and 
their numerical counterparts. The biological networks 
are not scale-free and the numerical data reproduce the 
data very well by tuning the parameter a. We observe 
that the value of the exponent a controls the maximum 
degree and the exponential cutoff of the distribution. For 
a = the exponential cutoff is at k = 1 and therefore 
the degree distribution a pure exponential. By increas- 
ing a, the range of the initial regime increases and the 
exponential cutoff moves toward larger k values. To tune 
the parameter, we compare the tail of the degree distri- 
bution for different a values and choose the one which 
fits best. In the procedure the smallest allowed degree 
is k = 1; the model then generates one large network and 
many small clusters, as in biological systems. We char- 
acterize this complex structure by evaluating the cluster 
size distribution. The cluster size is defined in terms of 
both the number of nodes, Sn, and the number of edges, 
Sm, belonging to the cluster. Figure [3] shows the cluster 
size distributions for different biological and numerical 
networks. Both distributions exhibit a regime consistent 
with a power law with an exponent ~ —4.4, for the size 
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AH 


1785 


7986 


0.5 


5.9 


1996 


8390 


0.65 


5.08 


BJ 


2807 


16453 


0.5 


6.2 


3551 


18043 


0.61 


4.85 


CK 


2032 


6609 


0.5 


8.1 


2471 


7398 


0.59 


5.81 


EC 


2677 


12620 


0.5 


6.2 


2354 


12250 


0.71 


4.92 


NF 


1683 


9435 


0.5 


6.8 


2569 


11257 


0.49 


4.33 


PA 


2613 


13024 


0.5 


7.3 


2778 


13263 


0.68 


5.09 


SP 


1778 


5911 


0.5 


6.4 


2484 


7468 


0.54 


5.33 


VC 


1717 


7726 


0.5 


5.6 


1842 


8028 


0.60 


4.74 


sc 


4711 


54570 


0.4 


3.7 


3351 


53012 


0.81 


3.89 


HS 


10890 


136799 


0.4 


3.9 


7864 


133576 


0.66 


4.15 



TABLE II. Properties of the largest connected cluster for the 
biological networks and their model counterparts: the num- 
ber of nodes Sn 1 **, the number of edges SjJJ- 8 *, the average 
clustering coefficient C ma x, and the shortest path length Z ma x- 
The error bars are 1%, 2%, 4% and 2%, respectively. 



in terms of sites, and an exponent ~ —2.7 for the size in 
terms of edges. The faster decay found for the first distri- 
bution suggests that the structure is highly clustered, as 
will be confirmed later. Furthermore, in most cases the 
size of the largest connected cluster is comparable (Table 
III]). Interestingly, numerical data for /(5m) also repro- 
duce the fluctuations at small sizes observed in biological 
data. These are not the effect of statistical noise, but 
measure the relative weight of the population of clusters 
with few edges, whose patterns can be simply identified. 
The level of connectivity in the system is measured by the 
average clustering coefficient of nodes of degree k and the 
average shortest path between nodes of degree k (Fig. |4| . 
Both quantities vary smoothly with k for biological and 
numerical data. Both the model and biological networks 
are highly clustered. Moreover, biological data show that 
the average shortest path length slowly increases with k 
for low connectivity degrees and then reaches a fairly 
stable value for a wide range of k, in agreement with 
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FIG. 5. (Color online) Average nearest neighbor degree 
knn(k) of nodes of degree k versus k for AH (circles), PA 
(triangles) and VC (stars) and their corresponding model net- 
works (lines). Top and bottom data sets are shifted vertically 
by a factor of 2 and 0.5. 



the topological properties exhibit similar behavior. It is 
also possible to infer the a value by analysis of only 75% 
of the entire protein data set. 

From the statistical point of view our model seems to be 
a good candidate for modeling the topology of protein 
interaction networks. However, the ingredients we im- 
plement are not well established for protein interaction 
networks, although they are present in other biological 
systems. Stem cells are an example of depletion. When 
a stem cell specializes and becomes a particular cell (a 
red blood cell, a muscle cell, or even a neuron) it loses 
the ability to interact with cells from other types |28j . 
Moreover, the similarity concept can be interpreted as 
the establishment of interacting protein families [29l [30] . 

IV. DISCUSSION 



numerical data. This result suggests that the model net- 
work reproduces not only the distribution of connectivity 
degrees, but also the relative position in the network of 
nodes with the same k value. Moreover, the high value 
of the clustering coefficient and the small shortest path 
length suggest that biological and model networks have 
small world properties [2 7) . Finally the average cluster- 
ing coefficient C max and the average shortest path length 
Z max evaluated for the largest cluster show a very weak 
dependence on the cluster size S™ ax and exhibit (Table 
In]) a good agreement between biological and model data. 
A further confirmation that our model captures the struc- 
ture of the network at both a global and local level is 
given by the evaluation of the average degree of the neigh- 
bors of a site of degree k, k nn (k) (Fig. |5J). This quantity 
increases with the node degree as fc - 67 ^ 02 for biological 
networks, and k - 61±om for numerical data. This scaling 
behavior suggests that highly connected nodes tend to 
be connected with each other. 

Finally we notice that, for each system, topological prop- 
erties are very stable with respect to changes in the fitting 
parameter and the calculation of the similarity. Even if 
the fitting value of a is changed by ±0.25 or the similar- 



ity rule is modified [ e.g., using p" 



JVo(*,i)/(fc<+*i)], 



In conclusion, we present a statistical model, which 
reproduces surprisingly well many topological properties 
of protein interaction networks. The model is based on 
a twofold mechanism for evolution, namely, preferential 
depletion and similarity. By fitting a single parameter, 
we are able to generate networks that reproduce protein 
interaction networks for different bacteria as well as Sac- 
charomyces cerevisae and Homo sapiens. We wish to 
stress that not only do the largest clusters exhibit the 
same connectivity properties but also the small-cluster 
distributions show very good agreement between biolog- 
ical and model data. The clustering coefficient and the 
average path length suggest that highly connected nodes 
are placed in the largest cluster and preferentially con- 
nected to nodes with high degree. The systematic anal- 
ysis of the network structure for a number of biological 
systems indicates that protein interaction networks are 
not scale- free but rather exhibit small- world properties. 
Further research should be performed to better under- 
stand the origin of this dual mechanism in protein inter- 
action networks. 
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